feat: Add Sherpa ONNX backend for ASR and TTS#8523
Conversation
✅ Deploy Preview for localai ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
5f176ef to
f705e60
Compare
f705e60 to
c696f1c
Compare
2c97d0b to
ffc5018
Compare
|
seems I've completely missed this, sorry @richiejp ! |
|
No, problem, I put it on hold while I was blocked on testing other stuff, but can reboot it now. The main issue with this backend is testing, it has a huge feature/api surface. |
ffc5018 to
85339fa
Compare
👍 I see yep, would make sense then to try out pointing claude at https://github.com/mudler/LocalAI/tree/master/tests/e2e-backends as we have already a "small" suite e2e for backends directly by calling via gRPC - this basically skip all the API e2e tests and jump directly to exercise the backend. It usually is very good at doing test scaffolding in order to test. worth a shot |
| pb "github.com/mudler/LocalAI/pkg/grpc/proto" | ||
| ) | ||
|
|
||
| func TestSherpaBackendStruct(t *testing.T) { |
There was a problem hiding this comment.
would be nice to use ginkgo here for consistency
| "os/exec" | ||
| "path/filepath" | ||
| "strings" | ||
| "testing" |
| package main | ||
|
|
||
| /* | ||
| #cgo LDFLAGS: -lsherpa-onnx-c-api -lonnxruntime -lstdc++ |
29c5906 to
ad9f1c7
Compare
Adds a new Go backend wrapping sherpa-onnx via purego (no cgo). Same approach as opus/stablediffusion-ggml/whisper — a thin C shim (csrc/shim.c + shim.h → libsherpa-shim.so) wraps the bits purego can't reach directly: nested struct config writes, result-struct field reads, and the streaming TTS callback trampoline. The Go side uses opaque uintptr handles and purego.NewCallback for the TTS callback. Supports: - VAD via sherpa-onnx's Silero VAD - Offline ASR: Whisper, Paraformer, SenseVoice, Omnilingual CTC - Online/streaming ASR: zipformer transducer with endpoint detection (AudioTranscriptionStream emits delta events during decode) - Offline TTS: VITS (LJS, etc.) - Streaming TTS: sherpa-onnx's callback API → PCM chunks on a channel, prefixed by a streaming WAV header Gallery entries: omnilingual-0.3b-ctc-q8-sherpa (1600-language offline ASR), streaming-zipformer-en-sherpa (low-latency streaming ASR), silero-vad-sherpa, vits-ljs-sherpa. E2E coverage: tests/e2e-backends for offline + streaming ASR, tests/e2e for the full realtime pipeline (VAD + STT + TTS). Assisted-by: claude-opus-4-7-1M [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com>
ad9f1c7 to
fe68e6a
Compare
I think it should be using e2e-backends now for gRPC level tests. There are e2e tests based on the realtime API and lower level tests in the backend source as well so that we have a 3 tier approach. Hopefully this is ready to go now. Next I'd want to use the real streaming for ASR and TTS in the realtime API. Also there are still features in Sherpa that haven't been exposed. I have to say though that I am not a huge fan of ONNX, it seems harder to package than GGML based backends if you want all of the GPUs to work and I haven't even tried here, just using CUDA and CPU. |
The Sherpa backend can handle... almost everything related to voice. So far with have VAD, ASR, TTS. It should be relatively simple to add wake words, diarization etc. However I've reached a point where there is so much stuff to test that I'm just going to add one model we don't already have (Ominiligual ASR) and go towards just trying to get the backend initially merged and then expand on testing.
Sherpa supports a lot of models we already have Python backends for, but at a fraction of the size because it is all based on ONNX. We also have ONNX backends already, but it's not clear that we have GPU acceleration for all of those.